-
Notifications
You must be signed in to change notification settings - Fork 212
Misc GPU updates #2172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Misc GPU updates #2172
Conversation
eero-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits / comments.
| apt-get update && apt-get install --no-install-recommends -y ocl-icd-libopencl1 && \N | ||
| rm /runtime/level-zero-devel_*.deb && \N | ||
| cd /runtime && dpkg -i *.deb && rm -rf /runtime && \N | ||
| apt-get update && apt-get install --no-install-recommends -y ocl-icd-libopencl1 wget ca-certificates && \N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the main size reduction from this duplication comes actually from removing the accidentally left (large) downloaded deb files, not from dropping l0-dev, wget, certs & their deps. Did you check that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It mainly comes from this:
COPY --from=builder /runtime /runtime
The runtime deb packages are copied from the build to the final phase, and while they are removed after the install the copy creates a large unnecessary layer. This is evident if you open the container in dive.
Would be nice if one could install packages directly from the build phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker does not support host volumes for builds (needs extension), but podman does. What if packages were on a host tmp volume (-v $(mktemp -d):/temporary:rw), I don't think those go to the final image?
426558c to
f270956
Compare
eero-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Additional container improvements don't need to be part of this PR.
Signed-off-by: Tuomas Katila <[email protected]>
f270956 to
51cf816
Compare
cmd/gpu_plugin/gpu_plugin.go
Outdated
| globalTempLimit := float64(dp.options.globalTempLimit) | ||
| memoryTempLimit := float64(dp.options.memoryTempLimit) | ||
| gpuTempLimit := float64(dp.options.gpuTempLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this follows the existing functionality but I wonder if we could hide this float stuff befind the levelzero API and deal with (u)ints here. or is the cgo api used elsewhere too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temp is in Celsius, not in Kelvin, so (in some contrived corner-case) it could be also negative. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two options:
- Move to integer in the CGo portion. It would change the protobuf definition and would break compatibility between older and newer plugin&levelzero containers. Probably not a big thing, but something worth thinking.
- Move to integers in the levelzero-bridge golang code (inside the plugin).
I'll see how the changes would look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something like 2. but it's not a blocker for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change.
Use the existing "temp-limit" as the global limit, and introduce GPU and memory thresholds. Signed-off-by: Tuomas Katila <[email protected]>
Fix uninitialized variable that caused random behaviour. Signed-off-by: Tuomas Katila <[email protected]>
By re-downloading the components, we save on the overall container size. While the build time increases slightly, the container size drops by around 100M (520->420). Signed-off-by: Tuomas Katila <[email protected]>
51cf816 to
d2e3c4b
Compare
Fixes #2166